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Abstract 

We outline how utterances in dialogs can be interpreted using a partial first order logic. 
We exploit the capability of this logic to talk about the truth status of formulae to define a 
notion of coherence between utterances and explain how this coherence relation can serve for 
the construction of AND/OR trees that represent the segmentation of the dialog. In a BDI 
model we formalize basic assumptions about dialog and cooperative behaviour of participants. 
These assumptions provide a basis for inferring speech acts from coherence relations between 
utterances and attitudes of dialog participants. Speech acts prove to be useful for determin- 
ing dialog segments defined on the notion of completing expectations of dialog participants. 
Finally, we sketch how explicit segmentation signalled by cue phrases and performatives is 
covered by our dialog model. 



1 Introduction 



During the last years, a large number of spoken language dialog systems have been developed 
whose functionality normally is restricted to a certain application domain. [SadMor97 give an 
quite extensive overview of existing implementations. 

Only few systems for generating dialog managers exist or are under development currently. 
These tools identify task and discourse structure and describe it by means of finite state au- 
tomata. Using these tools one can easily and quickly implement spoken language human-machine 
communication for simple tasks. Nevertheless, this approach lacks theoretical sufficiency for a 
large number of phenomena occurring frequently in natural language dialogs. HSadMor97 ] state 
that "these limitations rule out these approaches as a basis for computational models of intelligent 
interaction" . 

Recen tly, there has been some research on extracting dialog structure out of annotated corpora 
([ Moe97| ); algorithms for learning probability distributions of speech acts are used in this case. The 
estimated distributions serve as a basis for generating stochastic models for sequences of speech 
acts. But in this case, exploring common elements of dialogs in different domains is substituted 
by an abstract optimization process, although knowledge of these elements could be useful for 
improving parameter estimation. 

On the other hand, many approaches to dialog processing consider different structural elements 
important to gain a deeper understanding of the effects that utterances have on the dialog itself 
and its participants. But these approaches do not take spoken language and the problems related 
to speech recognition into a ccount . 

Following the opinon of [ Poe94 1 we consider the separation of describing and described situation 
to be a crucial point for dialog processing. This reflection is backed up by philological and linguistic 
research on discourse ( pie87| , |Mar92C , jBriGoe82|] , pun97| ). 

On this basis we present fundamental elements of dialog structure to handle even spoken 
language. The main aim of this paper is to build a bridge between research on spoken language 
processing and study of discourse structure and cognitive approaches to communication between 
individuals in order to conceptualize dialog systems that integrate experience from all areas of 
research just mentioned. 
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2 Different Previous Approaches 



As there exists a vast amount of literature on discourse and user models, we first give an account 
of some of the important directions of research performed up to now. The main interest of all 
this work is to make precise the notion of "context" which is said to be of great importance for 
natural language understanding. Inspired by the work of Grosz and Sidncr ([GroSid8£]) on the 
interrelations of task and discourse structure, a diversification to structural, semantical and 
plan-oriented studies has taken place. 



2.1 Discourse Structure 

The fundamental consideration for work on the structure of discourse is that there is a correspon- 
dence between the ordering of utterances in a discourse and how they are related among each 



other on the semantic level (|Gar94|, [Gar97Q. Tree structures are used to describe the semantic 
coherence of discourse. By using these structures, one can de fine constraints on possible places 
for attaching a new utterance to a existing discou rse ([Web91]) and on accessibility relations for 
potential refere nts for deict ic expressions ( |Pol95 |). In approaches based on Discourse Represen- 
tation Theory (| KamRcy95 l) this correspondence is captu red by construction rules (which can be 



defined in terms of an extended A-calculus — see | Kus96 ) building up Discourse Representation 



Structures that describe the coherence relation of all the included contributions^ 



2.2 Speech Act based Theories 

Many semantic theories work only locally, i.e. they describe the meaning of one single utterance, 
ignoring its context and the discourse situation in general. Insofar, they are unable to account for 
the functionality (i.e. intention) of a per ceived utterance. 

Basing on earlier work by Austin, | 5ea69[ proposed a theory of speech acts that has been 
fundamental for research on this area. Speech acts are implemented in dialog managers to derive 
hypotheses of how the current utterance contributes to the dialog so far. But considering the 
last utterance only is insufficient for a cooperative dialog participant as this local view does not 
pay attention to expectations of other parties involved in the dialog (e.g. when somebody asks a 
question, she expects the following utterance to be an answer to it). Consequently, to describe 
coherence in dialog steps, the effects of previous utterances have to be recorded somehow. So, the 
structural approach of conversational games mentioned briefly above provides an explanation for 
the speaker uttering something in the course of dialog. Another point of view to the coherence 
problem is taken by [TraA1194]: they propose that the speech acts associated to each utterance 
impose social and conventional obligations on the hearer and therefore constrain the set of possible 
legal responses. Equivalently, one can state that after uttering something the speaker has certain 
expectations that she wants to be fulfilled by any response that will be given in the next dialog 
step. 



2.3 Intentions, Plans, and Coherence of Utterances 

By now, we are able to describe how linearly (by time of being uttered) ordered contributions to a 
dialog can be integrated into a (partially ordered) discourse structure, but it is still impossible to 
explain the motivations of the speaker to use a certain speech act, especially in the case when the 
expectations introduced by the previous are violated. To answer this question, one has to study 
the mental attitudes of dialog participants. 

Motivations for engaging in a dialog can be taken into account by studying planning of utter- 



ances. Discourse planning is discussed extensively e.g. by Lambert and Lochbaum ([Lam93| and 



1 DRT has been extended by many researchers in different ways — e.g. by Asher for capturing discourse segments 
([A.sh93]) or in the V FVRRM ORTT, project for handling complex phenomena of spontaneous speech for machine 
translation ([ ym!35 |, flvm83| ). But all these theories initially describe monologs and therefore do not consider 
multi-party communication which is characteristic for dialogs. 
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[Loc94|). Both authors concentrate on the integration of domain dependent planning steps into 
the interpretation of utterances. Approaches such as those by |MooYou94 ] or [CarCar94| devise a 
model for collaborative plans for response generation. 

Another line of research focusses on the BDI model which is a more domain independent ap- 
proach. This accentuates an agent-based view of dialog as the participants in dialog and their 
personal attitudes are considered to be of main interest. Beliefs, desires, and intentions are as- 
sumed to drive utterances and speech acts ([AshLas97], [AshSin93|). So the key problem for the 
interpretation by the dialog manager is the reconstruction of the speaker's attitudes from what 
she uttered. Coh erence of utterances is obtained by fundamental assumptions on cooperative 
behaviour ([ |Grice |) and by analyzing ho w the cont e nt of utter an ces coheres on the basis of the 
(mutually believed) domain knowledge ([AshLas91|, [ AshLas94|, [ AshLasObe92|). Along this di - 



rection of research there is also some work on coherence relations between utterances ([Kno96 
for an overview see [ BatRon94 |). These relations characterize the logical connection between ut- 
terances and thereby serve as a basic instrument for an analysis of the argumentative structure 
described by a given dialog. 



2.4 Spoken Language Phenomena 

In spoken language hesitations, repairs, etc. are very common, because in oral communication 
concentration on the topic of the dialog limits mental resources available for speech production. 
These phenomena cannot be captured by semantic formalisms as sketched above. To overcome this 
problem, multi-level processing of spoken language utterances has been proposed in the literature 
(e.g. see |TraHm9l|). 



3 Interpretation of Utterances 



This section explains our approach how utterances can be interpreted using First Order Partial 
Information Ionic Logic (FIL, |Abd95 |) as a language for describing the semantics of utterances. 

A central issue of dialog management is that dialogs are motivated by the speaker's desire 
to add information to the knowledge of the dialog participants. On the other hand, it occurs 
frequently that the shared knowledge of the participants does not contain enough information 
to meet the expectations that are "pending" between speaker and hearer. For that reason, our 
semantic language must be able to handle situations of partial knowledge. FIL provides formulae 
(so called ionic formulae) like 

*({<pi, :.,(f>k},0 

meaning intuitively that £ is true when it is plausible that $ = {(f>i, 4>k} (called justification set 
or justification context) is true, too (see [Abd95], Sect. 5). $ is the set of missing information to 
infer £. FIL can be used to compute such justification contexts. 

We incorporate FIL for the description of conditions in a DRT-based framework to represent 
dialog structure. An example of such a discourse representation structure (DRS) would be: 

Does a plane depart from Athens to Rome? 

has the semantic representation 

t Rome Athens 



PLANE(t) 
DEPART(t) 

AiRPORT(Athens) 
Airport (Rome) 
FROM(t, Athens) 
To(t, Rome) 

In DRT, there exists a number of construction rules for incremental composition of several 
individual utterances. One can even infer whether DRS Kn is a consequence of the DRS K\, 
Kpj-i. But whilst in standard DRT conditions are described by classical first order formulae, in 
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our approach FIL is used for that purpose. As FIL is a partial logic, we can compute whether Kn 
is undefined given Ki, Kpf-%. This is true, as FIL allows to talk about the truth value of a 
formula: 

undefined(0) -£=>^ ~ (j) 
So, we are able to assign one of the following three consequence states to Kn'- 

• |= Kn- Kn follows from the discourse so far. 

• Kpj <^=>|= -^Kpj: ~>Kn follows from previous utterances. 

• |=~ -iKjvA ~ Kn- Kn is still undefined. 

Using the deduction theorem (Au{</>i, ...,4>k} \= i> -^=> A |= <f>\ A...A0A' — > ip) we have established 
a coherence relation between utterances via implication in FIL. When (j>\ A ... A 4>k —> ?p is true, 
*({</>i, ...,4>k}, is true, too. 

Interpreting an utterance requires a knowledge base A for the representation of the domain 



relevant knowledge. As described in [LudG6Nie98|, we use description logics to define the notions 



to be understood by the dialog manager for a given application. More precisely, description logics 
serve for constructing a terminology of the domain, thereby representing domain dependent, but 
situation independent knowledge. In order to interpret a specific utterance in a given situation, the 
given terminology is instantiated by concrete facts that are entailed by the semantic representation 
of the current utterance. To give a simple example of this idea, we could state in the knowledge 
base of a flight information system that a flight from a departure location to an arrival location 
is a flight characterized by the existence of an airport at the departure and the arrival location, 
respectively. In description logics, we could say: 

FLIGHTFROMTO = 3FROM. AIRPORT fi 3TO. AIRPORT (1 FLIGHT 

So, this definition characterizes knowledge that holds in every situation in the given application 
domain. On the other hand, the DRS above describes a concrete situation. 

We conclude that the consequence states mentioned above have to be understood as conse- 
quence on the basis of a situation independent knowledge base that characterizes the application 
domain. To interpret utterances in a given situation, the dialog manager tries to infer the con- 
sequence state of the current utterance relying on his domain knowledge. In general, this state 
depends on a certain justification context as outlined above. As will be shown below, we can verify 
the truth value of all elements in a justification context $ if we interpret each <pi as a question to 
the hearer and view the subsequent response as an answer to this question. This means that the 
dialog manager's planning steps are strongly affected by the results of inference in its knowledge 
base. From that view and the semantics of *({<fii, ...,(j>K}i' t P) we can derive an n-ary AND/OR 
tree that reflects the discourse structure of the discussed dialog segment. 



In the tradition of |GroSto84 and |GabRey92 we consider (free) discourse referents of inter- 



rogative pronouns as A-bound variables. If the problem solver finds a solution for the posed query, 
then it binds these variables to discourse referents that have been introduced earlier (or during the 
process of problem solving). Of course, there can be more than one possible substitution of the 
A- variables. For given variables x\, xm, we denote a substitution of all variables by discourse 
referents t\, tu as £ = {[xi/ti], [xmAm]}- {Si, S^-} is a set of K pairwise distinct 
substitutions. 

In the general case where the result of the inference process consists of a set {Si, E^-} of 
answer substitutions and a set {<fii, 4>n-i} of justification substitutions, each induces an edge 
in a OR-subtree of the overall discourse structure. 

This tree structure is part of the describing situation for the current dialog. Below we will 
introduce operations on the dialog tree that characterize how the structure is expanded in the 
course of dialog. In this sense, such a tree constitutes the "syntax" of the current dialog. But 
there is a strong connection to what could be called "dialog semantics" . It is grounded basically on 
the meaning of the edges in the tree: they express the fact that parent and child nodes are coherent 
in the sense of FIL consequence explained above. Furthermore, by the notion of satisfiability of FIL 
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ionic formulae we exploit the tree structure to reformulate Grosz' and Sidner's relations dominance 
and satisfaction precedence^, because justification context $ = {<fii, 4>k], when is given, 

is true if and only if all 4>i are true on the justification 

levelQ, we have iji | 4>i for all 1 < i < k. And as |^ cf>i for one i £ {1, k] implies 
we obtain <pi -< <f>j for 1 < i < j < k. 



4 Basic Elements of Dialogs 



4.1 Empirical Evidence for the Need of User Models 

Dialog managers for real world applications have to be robust in the sense that they always 
terminate an (user-)initiated dialog in a controlled way. So, the study of how to build robust 
generic dialog managers implies to reason about what structures exist in a dialog and how they 
get modified by utterances. On the other hand, it is also important to understand how dialogs 
affect the participants and their future utterances. 

A first approach to this problem is to see the function of utterances as that of updating the 
shared knowledge — a data structure maintained and used by all dialog participants. From this 
point of view, each dialog participant infers the same consequences from every new utterance. 

But as pointed out in the AI literature, actually people hold personal assumptions about the 
meaning of an utterance. These assumptions can differ among dialog participants. We illustrate 
this by an example taken from the TRAINS corpus (see | TrainsCorpusf ) : 

1.1 M: okay 



1.2 : we have to get .. a 

1.3 : tanker car of orange juice to uh Avon 

1.4 : and a 

1.5 : boxcar of bananas to Corning 

1.6 : and we have to do that by 3 PM today 
2.1 S: okay 

3.1 M: okay 

3.2 : so let's see umm 

3.3 : ... we probably have to take the tanker car 

3.4 : from Corning to Elmira 

3.5 : to get uh 

3.6 : orange juice in it 

3.7 : um 
[2sec] 

3.8 : [click] and uh 

3.9 : how far is it from Corning to Elmira 

3.10 : how long would it take 
4.1 S: 2 hours 

5.1 M: m hm 

5.2 : okay so 
[2sec] 

5.3 : why don't we uh 



let's see 
[sniff] 

okay why don't we 

would w/ 

uh 

why don't we consider sending uh 
engine E2 
to Corning 
to get the tanker car 
and uh 

bring it back to Elmira 
uh 

and uh have them 

have them fill it 
with OJ 

so how long would it take 
: well y / you need to get oranges 
to the OJ factory 



In (1.1) to (1.6) M describes the goal of this dialog and some constraints. Doing that he 
makes some of his mental attitudes public, thereby assuming that S will be able to interpret 
them appropriately. So (1.1) to (1.6) do not transport content about the domain, but about M 
exclusively. The impact of the observation that utterances can contain domain relevant knowledge 

2 o dominates /3 (a f /3) if and only if /3 is part of a, while a satisfaction-precedes 3 (a ■< 3) if and only if a is 
neccessary for /3 

3 i.e. (stated in model-theoretic semantics) there exists an interpretation that expands any interpretation that 
makes ip true in such a way that it assigns true to all </>;, too. 
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as well as knowledge about other dialog participants is enormous: In (6.1) and (6.2) S tries to 
explain why M will not be able to reach his goals by explaining why M's information about the 
domain and the current domain scenario is incomplete or false. This is possible only because S can 
differentiate between his own domain knowledge and that transported by the previous utterances. 
As a consequence, S' cooperative behaviour is made possible by his ability to reason about the 
domain and about his assumptions of M's view of the domain. 

This example shows that cooperative dialog managers must maintain some sort of user model. 
Our approach will be discussed in the remainder of this section. 



4.2 Rational Behaviour of Dialog Participants 

Discussing dialog management, one normally assumes that people initiate communication with 
others in order to get help for achieving a certain goal. From these observations, we can derive 
that questions are asked to get an answer that completes the speaker's knowledge in some way. 



For an answer to be helpful, it has to meet certain constraints (expressed by [Gricc] in his 
maxims of cooperation): first, it has to be coherent with the question so that it can deliver 
valuable information. And, of course, it should be true. These requirements pose constraints on 
the behaviour of the person that is giving the reponse, too. This person has to be cooperative, i.e. 
she should adopt the speaker's goals, as far as she can realize them. Honesty is another crucial 
point. For a person not feeling obliged to telling the truth is not a reliable source of information. 

Our dicussion is restricted to dialogs that fulfill the requirements above. I.e., we assume some 
amount of rational behaviour for all dialog participants. 

To reason about goals and intentions, one has to study the cognitive structures that underly 



rational behaviour. For dialogs, these structures are described at length in |TR663 |. 

One major challenge for designing robust dialog systems seems to be how to reconstruct the 
contents of the mental states of the user out of the utterances — the only observable facts. So 
the study of how language can transfer attitudes and reflect planning steps of dialog participants 



becomes very important. [Eng88 notes that in German basically one can communicate three 
different types of utterances: 

• constative: state facts. Our requirement of honesty admits the conclusion that nothing 
wrong is intentionally stated to be true. 

• interrogative: ask something. 

• imperative: request something. 



4.3 Discourse Domain and Application Domain 

To formalize the intuitions described up to now, we need a framework that enables us to talk 
about utterances or the corresponding DRS, respectively. 

We can achieve this by introducing a discourse domair^ that is the domain of describing 
situations. Atomic elements are DRS whereas in the application domain (consequently to be 
defined as the domain of described situations) atomic elements are objects of the application. This 
reflects the ability of natural language to "climb up" to a meta-level, e.g. simply by saying "What 
you have told me up to now, has been clear to me. But I can't understand what I should do now." 
If one thinks about a situation 

when a teacher instructs a pupil how to achieve some goal, then it becomes obvious that 
responses as above refer only to the instructions, but not to what is instructed currently. We use 
this framework to reason about the relation of utterances and speaker's attitudes. Following the 



notation in [ AshLas97|, we express the connection between formulating attitudes and the attitudes 



4 The discourse domain is not the domain of discourse as we try to make clear in the remainder of this section. 
We call the domain of discourse the application domain to express the fact that discourse structure and application 
structure (sometimes called task structure) are different and not isomorphic. 
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themselves in the following wa; 



constative^) -> Wi(B R (K)) (1) 
interrogative^) -> Wj (2) 
imperative (X) -> Wi(do R (K)) (3) 

To describe defeasible rules, we exploit FIL ionic formulae: defaults eventually to be defeated by 
overriding exceptions are expressed as ionic formulae. By that the consequence of an implication 
is assumed to be valid as long as there is no evidence to the contrary. If we have <fr — > ip) and 
</>, then tp is considered true, unless there is evidence for ip to be false. So we can formulate 

• a principle of sincerity: 

Ii(B R (K)) -> *(B / (Jf),B J (IO) (4) 



• principles of cooperation: 

W/(if)A6;K) -» *(W s (if)AB fl K),W fi (if)ABflK)) (5) 
Wj(do z (X)) -» *(W i? (/C / (^)),W fi (/C / (if))) (6) 

After having defined how linguistically motivated types of utterances and mental states are 
interrelated among each other and how fundamental principles of collaboration in dialogs are 
expressed formally, we show (by means of two examples) how mental states and speech acts are 
connected via defeasible rules in FIL: 

Wi(Ki(K)) -► *(query J (lir),query J (A')) (7) 
Br(->(~ — > K)f\ <~ (£ — > K))) A 

W fl (/Cj(if)) A Wh(B/(0) - *(inform fl (0,infbrm B (0) (8) 

Informally, the first rule states that anything one wants to know is normally asked for, while the 
second rule claims that if one wants another dialog participant to know something and believes it 
could be a consequence of something else and wants the other to know that, too, then one informs 
about that new fact. 

We can draw the following conclusion: After the speaker (I) has asked K and gets £ as response, 
she can infer defeasibly inform#(£) assuming that £ is constative and relying on cooperation if there 
is no evidence for R (the hearer of I's utterance, but now speaker of £) against the belief £ — ► K. 
Furthermore, by sincerity I can also infer that R intends / to believe £ — » K . 

This example outlines our approach of to how reconstruct speech acts from observable utterance 
types via reasoning about the speaker's intentions. Recognizing speech acts is important for a 
cooperative dialog manager as it helps to explore what state the dialog currently is in (e.g. a 
question has been asked and an appropriate answer is now expected). It seems to be just this 
notion of dialog state that captures the interactive nature of dialogs in contrast to monological 
discourse like a newspaper article. Describing how the transition of the dialog state looks like in 
the course of utterances therefore extends the notion of coherence in discourse sketched above. 
But, as can be seen easily from the inform rule (||) above, coherence is the link betwen mental 
states and speech acts: no speech act can be recognized without reasoning about coherence of 
utterances. In the rule above, inforrrifl(£) can be inferred only if |= £ — » K or |^ £ — > K. The case 
when £ — ► K is undefined, will be discussed later. 

5 We denote the speaker / (initiator) and the hearer R (responder). B expresses beliefs, W desires, X intentions, 
and K knowledge 
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4.4 Dialog States and Dialog Structure 

First we turn to the discussion of dialog states and their transition relation: As noted in the 



literature (see e.g. | TraA1194 , CriWeb97|), utterances in dialogs induce expectations of what a 
plausible response should look like. Stated differently, by her speech act, the speaker poses some 
obligation on the hearer that constrains the preferred responses. Nevertheless, a model of dialog 
that does not restrict the range of "valid" utterances too much, should give an account for a 
cooperative reaction even if expectations have been violated in a response. 

Meeting an expectation and thereby completing the conversational game opened by the initator, 
the responder implicitly performs a segmentation of the discourse. In what follows, we describe 
the syntax of discourse segments using a context free grammar whose rules are attributed by 
operations on the dialog structure depending on the coherence between the current utterance and 
the knowledge shared between the dialog participants. In that way, we define the semantics of 
discourse segmentation in terms of coherence of utterances using the notion of the AND/OR tree 
introduced above. 

4.4.1 Simple Segments 

In a simple segment the speaker poses no obligations on the hearer. When no obligations are 
pending, an inform act would create a simple segment according to the following rule: 

SEG inform/ (K) (9) 

SEG = new AND (SEG, K) 

This rule states K can be added as a new fact to the shared knowledge, and therefore be inserted 
as an AND branch into the currently centered segment. In the absence of pending expectations, 
the root segment (i.e. the root of the dialog tree) is centered by default^]. 

4.4.2 Complex Segments 

Dialog segments are called complex when the speaker assigns expectations to her utterance. For 
example, when asking a question she expects an answer to be given by the hearer that helps 
perform her intentions. But it is obvious that the hearer could have reason not to fulfill the 
expectations (at least temporarily) : she could need additional information first in order to answer 
the question and therefore respond with a new one. After this question will have been answered 
the hearer will respond according to the still pending expectation. We capture this situation with 
the following rules: 

SEG -> query j(K) QUERYEXP (10) 
SEGi = newAND(SEGi, newSubTree(Jf, QUERYEXP)) 

SEG -> SEG SEG (11) 
SEGi = newAND(SEG 2 ,SEG 3 ) 

Here, the first rule says that a segment can be initiated by a query imposing an obligation to 
answer it. On the other hand, as it is expressed in the second rule, a segment can consist of two 
sub-segments forming an AND subtree. QUERYEXP follows the rule: 

QUERYEXP -> ANSWER (12) 

QUERYEXP = ANSWER 
QUERYEXP -> ANSWER QUERYEXP (13) 

QUERYEXP x = newOR( ANSWER 2 , QUERYEXP 3 ) 



"Below, after having introduced the operations on the dialog tree, we will give an example illustrating the 
application of the segmentation rules. 



s 



ANSWER is defined as follows: 



ANSWER -> inform/(if) (14) 

ANSWER = newNode(if) 
ANSWER -> SEG inform/ (iT) (15) 

ANSWER = changeRoot(SEG, K) 



4.5 Example 



In this section, we give an example of how our dialog model behaves at work. For the purpose of 
illustration we use the following short flight information dialog: 

a User: I want a flight from Athens. 

Px System: Where do you want to go to? 

71 User: To Rome. 

/?2 System: On which day? 

72 User: On Monday. 

/?3 System: When do you want to depart? 

73 User: At 12 o'clock. 

p System: There are several connections: AZ717 at 06:55, OA233 at 09:05, AZ725 at 09:20, 
OA235 at 12:55, AZ721 at 16:00, OA239 at 17:10, AZ723 at 20:20. 



As will be discussed in Sect. 7.2, / want expresses a desire explicitly and is therefore repre- 
sented as W/(a). From that we conclude *(query 7 (a), query J (a)). Besides of that, we have the 
implications W/;,(/C/(a)) by cooperation and -iinform/(a). On the other hand, grammatically a is 
constative. Therefore constative(a) — > Wi{K,n{a)) and, consequently, * (inform/ (a), inform/(a)). 
But justification context inform/(a) is in conflict with -iinform/(a). Therefore, the only derivable 
hypothesis for a's speech act is query 7 (a). According to the rules for discourse segmentation 
above, this induces a QUERYEXP for R (i.e. the dialog manager). This obligation is expressed 
in the user model by Wr(K.i(cl)). 

In order to fulfill this desire, the dialog manager tries to evalute the consequence status of a 
(see Sect. |^). This evaluation is performed on the basis of the domain model (for a more elaborate 



example see [LudG6Nie98 ). Its result is that the status of a depends on the justification context 
{At, To, On}. I.e. departure day and time as well as the destination are still unknown. 

So, mental attitudes of the dialog manager change: a new desire Wr(K.r{^\)) is added that 
implies *(query fl (/3i), query fl (/3i)). I.e. a new discourse obligation is being introduced leaving the 
old one pending. 

For the user's response constative(7i) holds, implying W/(/C/?, (71))- From 71 the dialog man- 
ager infers 71 — > Consequently, it assumes 6/(71 — > (3\) and *(inform/(7i), inform/ (71)). 
As there is no evidence against inform/(7i) and no other hypothesis for a possible speech act, 
inform/(7i) is assumed. This completes query/j(/3i). With this completion, the dialog manager 
obtains information about the destination. 

Processing of /?2, 72 and (3s, 73 is performed analogously. After that, the justification context 
inferred during the interpretation a has been answered completely. Having got all this information 
the dialog manager can compute a solution for a that is uttered in p. 

Fig. [I] shows the dialog segmentation for this example, while Fig. || sketches the coherence 
structure of all utterances, represented in a DRS in Fig. ||. A point worth mentioning is that the 
structure of the justification context as an AND tree is embedded in the coherence structure of 
the dialog. 
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SEG 



query J (a) 



QUERYEXP 
I 

ANSWER 



SEG inform/? (p) 



SEG 



QUERYEXP query i? (/3 1 ) 
I 

ANSWER 




SEG 



SEG 



inform/ (71 



QUERYEXP query H (/? 2 ) QUERYEXP query R (/3 3 ) 

I I 

ANSWER ANSWER 

I I 

inform/^) inform/^) 

Figure 1: Segmentation of the Example Dialog 

query/ (a) 
I 

inform r (p) 




query R (/3i) query R (/3 2 ) query R (p 3 ) 

I I I 

inform/(7i) inform/(72) inform/(73) 

Figure 2: Coherence structure of the Example Dialog 

5 Incoherence and its Effects on Dialog Structure 

This section explains in more detail how our dialog manager handles situations when two utter- 
ances are incoherent in the sense mentioned earlier although they otherwise would not violate any 
expectations. Consider the dialog is Fig. |] for a motivation of the problem: 

This dialog is quite regular according to the rules on dialog state described in the previous 
section until U utters 8. Shared domain knowledge A and the facts collected during the dialog 
do not allow to conclude that 9 and r\ are coherent. One can derive that A U {9} -^r/A ~ r\. 
So the question arises which speech act to assign to 9. To answer that question we have to go 
back some steps in the dialog: in occasion of uttering 7, S made U assume VVs(/Cs(7)) implying 
Wuifcsd))- From this observation and from A U {(} (= 7, S can infer that U wants to give an 
answer to 7 — thereby ignoring S's expectation that e will be answered by U. 



How can such a situation be described in our model of dialog states? Following [Tra94], we 
assume that several speech acts can be assigned to one utterance allowing the speaker to express 
multiple intentions at one time. Except of the inform act derived be inferring coherence between 
and 7, we assign a cancel act to U's last utterance, cancel has the meaning that a pending dialog 
obligation is violated intentionally as it is the case when U referred to 7 when responding to 77. 
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o. Pi Pi Pz 71 72 73 P 



a = 



f Athens 



Flight (f) 
From (f, Athens) 
want(u, a) 
u 



pi = XL 



Go(u) 

Location^) 
To(u,0 
7i = Pi (Rome) 
u 



P 2 = Xd. 



Go(u) 

DAY(d) 

On(u, d) 
72 = /?2 (Monday) 
u 



ft = At. 



Go(u) 

TlME(i) 

At(u, t) 
73 =/3 3 (12 o'clock) 

c AZ717 OA233 AZ725 OA235 AZ721 OA239 AZ723 



P = 



Several(c) 

Connection(c) 

Flight(AZ717) 

Flight (OA233) 

Flight(AZ725) 

Flight(OA235) 

Flight(AZ721) 

Flight(OA239) 

Flight(AZ723) 



At(AZ717,06 : 55) 
At(OA233, 09 : 05) 
At(AZ725,09 : 20) 
At(OA235, 12 : 55) 
At(AZ721,16 : 00) 
At(OA239, 17 : 10) 
At(AZ723, 20 : 20) 



Figure 3: Discourse Representation Structure for the Example Dialog 



cancel is defined by 

(~ -^KA ~ K) A \nform R (K) -> cance\ R (K) 
To integrate cancel into speech act processing, we add a new rule for ANSWER: 



ANSWER 



cancel/(isf) 
ANSWER = 



U: Is there a flight to Rome on Saturday? 

S: Yes. LH745 at 10:38, AZ304 at 15:03, or 2G261 at 16:25. 

S: Which airline do you prefer? 

U: The Alitalia flight would be quite convenient. 

U: Do they offer business class? 

S: Yes, they do. 

S: Have you got a MilleMiglia card? 

U: [Hmm, actually] I rather prefer Lufthansa due to their superb service. 



Figure 4: Example Dialog 
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6 Configuration of Dialog Managers 



Our approach to dialog understanding as it has been characterized up to now is dominated by the 
idea to separate domain independent algorithms and data structures from domain dependent data 
for specific applications and to separate the discourse domain proper from the application domain. 
By the isolation of domain independent dialog elements we try to explore the minimal amount of 
dialog structures to be configured for a specific application. In particular, we have distinguished 
three models contributing to the setup of a dialog system for a given application: 

• Domain model 

It defines the notions that exist in the application domain and how these notions are being 
interpreted. Additionally, it describes how the vocabulary of the domain is connected with 
notions defined in the domain model (see Sect. 

• Dialog model 

The conversational games valid for the application (see Sect. and the rules how moves of 
different games can be interleaved among each other are defined in the dialog model. Rules 
for the games specify how the dialog structure (represented by the dialog manager as an 
AND/OR tree) is affected by a certain game. In addition, it is possible to restrict dialog 
participants to different sets of conversational games that they are allowed to begin. E.g. in 
a dialog model without mixed initiative the user would not be permitted to begin a query 
game, but only be allowed to react with inform. It is an open question whether a restriction 
of the kind just described could serve as a sufficient characterization of the complexity of 
dialogs. 

• Model of dialog participants (user model) 

In order to reason about motivations for conversational games one has to connect game 
moves (i.e. speech acts) with mental attitudes of dialog participants. For this purpose, the 
user model defines neccessary conditions of mental attitudes for each speech act. On the 
other hand, it also contains the general principles of rational behaviour that hold between 



the attitudes of dialog participants (see 4.2 ) 



During the configuration for a specific application the models sketched above have to be defined. 
We argue that at least for the dialog model and the user model there exists a large application 
independent subset of definitions that holds for any application and has only to be completed for 
a concrete domain. 

In many cases, such a subset would reduce configuration to the definition of an appropriate 
domain model. From this point of view, it would be worth analyzing which classes of dialogs could 
be covered by proposals for domain independent sets of speech acts (such as the one described by 



the Discourse Resource Initiative — see [DRI97]). 



7 Explicit Modification and Segmentation of Dialog Struc- 
tures 

Any model for describing discourse as the one sketched in this paper should give an account not 
only for an implicit construction of dialog structures, but also for its explicit modification by the 
dialog participants (see e.g. [Coh90|). Such an account would reflect the capability of dialog to 
talk about what |Bun97| has called "dialog control", or about attitudes and mental states of dialog 
participants. Switching to the "mcta-lcvcl" is normally signalled by cue phrases or performatives. 

In the remainder of this section we will discuss how these special types of utterances that have 
a well-defined meaning only in the describing situation of the dialog affect our dialog model. 
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7.1 Cue Phrases 



In our opinion, many natural language expressions are like polymorphous operators in object- 
oriented programming languages: they take arguments of different type and have different seman- 
tics each time. This view is shared by other researchers, too — e.g. | Ben88 |. E.g. in the utterance 
"Do you want to depart from Munich or from Frankfurt?" or expresses a choice between two 
locations — i.e. two objects of the described situation. On the other hand, in "Will you go there by 
bus or rather take the car?" or again states two possible alternatives, but, in this case, they are 
utterances — i.e. objects of the describing situation. 

To incorporate interpretation of cue phrases of into the dialog model we rely on Knott's work 
([ Kno96| ]) on coherence relations. Knott discusses extensively how cue phrases contribute to the 
understanding of discourse coherence: He assumes that any cue phrase has the function of an 
operator between previous utterances a\, ckn and an utterance (3 following the cue phrase. 
These utterances are connected by a defeasible rule Pi A ... A Pjy — » C which we can express in FIL 
as *(Pi A ... A Pn — ► C, Pi A ... A Pm — > C). Each cue phrase has an associated set of features like 
polarity of the consequent etc. that define its semantics and how the utterances in the scope of 
the cue phrase are linked to Pi and C . E.g. for but, we have Pi :— ->ai and C :— (3. Consequently 
for "a, but (3" coherence between a and (3 is expressed by ->a — * (3. 

For the general case, coherence between utterances in the scope of the cue phrase can be 
established if {Pi, ...,P/v} |= C or {Pi, ...,Pjv}|^= C. This result can be exploited to update the 
dialog structure appropriately. 

For a deeper investigation of this topic let's have a look at the following dialog: 



la 


U: 


Is there a flight to Rome on Saturday? 


■a 


S: 


Yes. AZ631 at 15:03 


?7 


U: 


How much is it? 


.5 


S: 


DM 528 plus tax. 


?e 


U: 


or on Monday, what about that? 


■C 


S: 


On Monday you can fly with Debonair 


?// 


U: 


How much is a ticket? 


.0 


S: 


DM 199 plus tax. 



How is "or on Monday?" processed and integrated into the dialog structure? Firstly, it has to 
be noticed that ellipsis resolution on the PP "on Monday" yields as syntactic referent "Is there a 
flight to Rome?". 

Secondly, the DRS for the describing situation after 5 has been processed, is as shown in Fig. 



a/?7 5 



a = Xf. 



srf 



flight(/) 

Rome(i-) 

To(/,r) 

Saturday(s) 

on(/,s) 

(3 = a(az631) 

, az631p 
7 = Ap. 



COST(az631,p) 
5 = 7(528+Tax) 



Figure 5: DRS after 5 has been processed. 



Combining the remark on the ellipsis "on Monday" and the DRS for the dialog so far, we find 
that the right-hand argument of the operator or is ck[Saturday/Monday]^|. So, we can denote a 

7 i.e. in a all appearences of Saturday are substituted by Monday 
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DRS (see Fig. ^) that describes the semantics of e. 



X a[Saturday/Monday] 





srf 




a = Xf. 


flight(/) 
ROME(r) 
TO(/, r) 
Saturday(s) 
. on(/,s) 





X =? 

X or a[Saturday/Monday] 
Figure 6: DRS for e 

Obviously, as can be seen from the DRS for e, the problem of finding an appropriate discourse 
referent for X can be solved by anaphora resolution in the describing situation. The only an- 
tecedent in the describing situation compatible with a[Saturday/Monday] is a. So, a has been 
centered by e. 

The impact of e on the dialog structure is determined essentially by how or modifies the dialog 
segmentation: the segment of a is substituted by an OR subtree representing the two arguments 
of or (see Fig. |?]). For the utterances to follow, e is the center, and dialog processing works as 
usual. 

7.2 Performatives and Modal Verbs 

Performatives and modal verbs state assertions not about the described, but the describing situ- 
ation; more precisely, they express assertions about mental states and speech acts as in "I must 
leave you now" , "On what day do you want to depart?" , "I suggest not to pay at all for this bad 
film." 

As such utterances do not talk about the described situation, they cannot be processed as if they 
did. Consequently, speech act recognition does not apply as normally in this case. For that reason, 
all rules above for inferring speech acts are defeasible. Therefore we can devise "special" rules for 
performatives and modal verbs that "override" defeasible inferences based solely on syntactic and 
prosodic criteria if an inspection of the content of the utterance provides evidence against these 
"default conclusions" . To do this, we classify performatives and modal verbs according to what 
mental state they operate on or what speech act they express. 

In the utterance "I want to fly to Rome on Saturday." , want expresses implicitly the query "Is 
there a flight to Rome on Saturday". It is clear immediately that the utterance is actually not 
an inform act that poses no obligations to respond cooperatively on the hearer. So desires and 



ROOT 




e A 7 




a[Saturday/Monday] V a S 
Figure 7: Dialog Tree after Applying aora[Saturday/Monday] 
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queries are defeasibly connected by: 

Wi(d 0l (K)) *(query / (if),query / (lir)) (19) 

Wi(doj(K)) overrides inform in the following way: 

W/(doj(i<r)) -» -ninform 7 (if) (20) 
When the dialog manager starts processing the utterance above, it infers the following: 

• inform/^) with justification context inform/^) 

• queryj(K) with justification context query j(K) 

As the semantics of "I want to" belongs to the class Wj (do j (K)), we can use this fact (that has 
been derived from the semantic representation of the utterance) to infer -^mformj(K) by (po|). As 
a result, the only valid inference is query j(K). By cooperation, we can infer Wr(1Ci(K)). After 
that, we are in exactly the same situation as if the speaker had asked directly for a flight to Rome. 



8 Conclusions 

We have reported about work in progress on developing a theoretical approach on dialog structure 
in order to build domain independent, cooperative, and robust dialog managers. The theoretical 
framework outlined in this paper has already been implemented partially and was tested for a small 
domain. In an evaluation, the implementation has proven to perform well. On the other hand, 
we have analyzed a large corpus of train-inquiry dialogs collected with our EVAR ( |EckNie94| ) 
system. An important result is that the quality of speech recognition determines how cooperative 
the EVAR systemP] really is. To improve this situation we work towards improving dialog theory 
as presented in this paper by integrating BDI-oriented, structural, and plan based approaches to 
dialog understanding. Experience from analyzing dialogs that have not been terminated success- 
fully has shown that our approach is capable to overcome the failures that caused the inacceptable 
terminations. 



9 Future Research 

Using the results presented in this paper as a basis, we will continue our work on dialog struc- 
ture by describing precisely how a processing level for handling typically difficult phenomena of 
spoken language like repairs etc. can be integrated into our model. We intend to achieve this by 



incorporating Traum's model of grounding (see |Tra94|) into our framework. This mechanism will 



have to be expanded to work properly on word hypothesis graphs that are the basic data structure 
of the common ground. This allows for a full exploitation of the results produced by the speech 
recognizer. Furthermore, we want to integrate our implementation of a chunk parser as a robust 



algorithmic approach to a theory of incremental discourse processing as described e.g. in [Poe94 
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